1 Introduction

Load packages:

library(tidyverse)
library(ggplot2) # superfluous because ggplot2 is part of tidyverse
library(scales) # for formatting labels for axes and legends

library(haven)
library(labelled)

Resources used to create this lecture:

1.1 Datasets we will use

We will use two datasets that are part of the ggplot2 package:

  • mpg: EPA fuel economy data in 1999 and 2008 for 38 car models that had a new release every year between 1999 and 2008
    • Note: There are no set of variables that uniquely identify observations
  • diamonds: Prices and attributes of about 54,000 diamonds
#?mpg
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "aud…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattr…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8…
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", …
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, …
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, …
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact…
#?diamonds
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.2…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 6…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.0…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.3…

We will use public-use data from the National Center for Education Statistics (NCES) Educational Longitudinal Survey (ELS) of 2002:

  • Follows 10th graders from 2002 until 2012
  • Variable stu_id uniquely identifies observations
# variables we want to select from full ELS dataset
els_keepvars <- c(
    "STU_ID",        # student id
    "STRAT_ID",      # stratum id
    "PSU",           # primary sampling unit
    "BYRACE",        # (base year) race/ethnicity 
    "BYINCOME",      # (base year) parental income
    "BYPARED",       # (base year) parental education
    "BYNELS2M",      # (base year) math score
    "BYNELS2R",      # (base year) reading score
    "F3ATTAINMENT",  # (3rd follow up) attainment
    "F2PS1SEC",      # (2nd follow up) first institution attended
    "F3ERN2011",     # (3rd follow up) earnings from employment in 2011
    "F1SEX",         # (1st follow up) sex composite
    "F2EVRATT",      # (2nd follow up, composite) ever attended college
    "F2PS1LVL",      # (2nd follow up, composite) first attended postsecondary institution, level 
    "F2PS1CTR",      # (2nd follow up, composite) first attended postsecondary institution, control
    "F2PS1SLC"       # (2nd follow up, composite) first attended postsecondary institution, selectivity
)
els_keepvars
##  [1] "STU_ID"       "STRAT_ID"     "PSU"          "BYRACE"      
##  [5] "BYINCOME"     "BYPARED"      "BYNELS2M"     "BYNELS2R"    
##  [9] "F3ATTAINMENT" "F2PS1SEC"     "F3ERN2011"    "F1SEX"       
## [13] "F2EVRATT"     "F2PS1LVL"     "F2PS1CTR"     "F2PS1SLC"
load(url("https://github.com/anyone-can-cook/rclass2/raw/main/data/els/els.RData"))

els <- els %>%
  # keep only subset of vars
  select(one_of(els_keepvars)) %>%
  # lower variable names
  rename_all(tolower)

glimpse(els)
## Rows: 16,197
## Columns: 16
## $ stu_id       <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 1011…
## $ strat_id     <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 10…
## $ psu          <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ byrace       <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, …
## $ byincome     <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, 1…
## $ bypared      <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2, …
## $ bynels2m     <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16,…
## $ bynels2r     <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66,…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, -…
## $ f2ps1sec     <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, 2…
## $ f3ern2011    <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 68…
## $ f1sex        <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, …
## $ f2evratt     <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1,…
## $ f2ps1lvl     <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, 1…
## $ f2ps1ctr     <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, 2…
## $ f2ps1slc     <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, …
els %>% var_label()
## $stu_id
## [1] "Student ID"
## 
## $strat_id
## [1] "Stratum"
## 
## $psu
## [1] "Primary sampling unit"
## 
## $byrace
## [1] "Student's race/ethnicity-composite"
## 
## $byincome
## [1] "Total family income from all sources 2001-composite"
## 
## $bypared
## [1] "Parents' highest level of education"
## 
## $bynels2m
## [1] "ELS-NELS 1992 scale equated sophomore math score"
## 
## $bynels2r
## [1] "ELS-NELS 1992 scale equated sophomore reading score"
## 
## $f3attainment
## [1] "Highest level of education earned as of F3"
## 
## $f2ps1sec
## [1] "Sector of first postsecondary institution"
## 
## $f3ern2011
## [1] "2011 employment income:  R only"
## 
## $f1sex
## [1] "F1 sex-composite"
## 
## $f2evratt
## [1] "Whether has ever attended a postsecondary institution - composite"
## 
## $f2ps1lvl
## [1] "Level of offering of first postsecondary institution"
## 
## $f2ps1ctr
## [1] "Control of first postsecondary institution"
## 
## $f2ps1slc
## [1] "Institutional selectivity of first attended postsecondary institution"

2 Concepts

Basic definitions:

  • Grammar
    • “The fundamental principles or rules of an art or science” (Oxford English dictonary)
  • Grammar of graphics (Wilkinson, 1999)
    • Principles/rules to describe and construct statistical graphics
  • Layered grammar of graphics (Wickham, 2010)
    • Principles/rules to describe and construct statistical graphics “based around the idea of building up a graphic from multiple layers of data” (Wickham, 2010, p. 4)
    • The layered grammar of graphics is a “formal system for building plots… based on the insight that you can uniquely describe any plot as a combination of” seven paramaters (Wickham & Grolemund, 2017, Chapter 3)
  • Aesthetics
    • Aesthetics are visual elements of the plot (e.g., lines, points, symbols, colors, axes)
    • Aesthetic mappings are visual elements of the plot determined by values of specific variables (e.g., a scatterplot where the color of each point depends on the value of the variable race)
    • However, aesthetics need not be determined by variable values. For example, when creating a scatterplot you may specify that the color of each point be blue.

The seven parameters of the layered grammar of graphics consists of:

  • Five layers
    • A dataset (data)
    • A set of aesthetic mappings (mappings)
    • A statistical transformation (stat)
    • A geometric object (geom)
    • A position adjustment (position)
  • A coordinate system (coord)
  • A faceting scheme (facets)

ggplot2 – part of tidyverse – is an R package to create graphics and ggplot() is a function within the ggplot2 package.

“In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.” (Wickham & Grolemund, 2017, Chapter 3)

Syntax conveying the seven parameters of the layered grammar of graphics:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>(
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

2.1 Layers

What does Wickham mean by layers? (from “Telling Stories with Data Using the Grammar of Graphics” by Liz Sander)

  • In the grammar of a language, words have different parts of speach (e.g., noun, verb, adjective), with each part of speech performing a different role in a sentence
  • The layered grammar of graphics decomposes a graphic into different layers
    • “These are layers in a literal sense – you can think of them as transparency sheets for an overhead projector, each containing a piece of the graphic, which can be arranged and combined in a variety of ways.”

The five layers of the grammar of graphics:

2.1.1 Dataset (data)

Data defines the information to be visualized.

Example: Imagine a dataset where each observation is a student

  • The variables of interest are high school math test score (bynels2m), earnings in 2011 (f3ern2011), and student sex (f1sex)
els %>% select(stu_id, bynels2m, f3ern2011, f1sex) %>% as_factor() %>% head(10)
## # A tibble: 10 x 4
##    stu_id bynels2m f3ern2011     f1sex 
##     <dbl> <fct>    <fct>         <fct> 
##  1 101101 47.84    4000          Female
##  2 101102 55.3     3000          Female
##  3 101104 66.24    37000         Female
##  4 101105 35.33    1500          Female
##  5 101106 29.97    48000         Female
##  6 101107 24.28    35000         Male  
##  7 101108 45.16    17000         Male  
##  8 101109 66.01    68000         Male  
##  9 101110 28.28    Nonrespondent Male  
## 10 101111 38.85    42000         Male

2.1.2 Set of mappings (mappings)

Mapping defines how variables in a dataset are applied (mapped) to a graphic.

Example: Consider the previous dataset

  • Map HS math test score to the x-axis
  • Map 2011 income to the y-axis
  • Additionally, if we are creating a scatterplot of test score (x-axis) and income (y-axis), we might use sex to define the color of each point
els %>% select(stu_id, bynels2m, f3ern2011, f1sex) %>% 
  rename(x = bynels2m, y = f3ern2011, color = f1sex) %>% 
  as_factor() %>% head(10)
## # A tibble: 10 x 4
##    stu_id x     y             color 
##     <dbl> <fct> <fct>         <fct> 
##  1 101101 47.84 4000          Female
##  2 101102 55.3  3000          Female
##  3 101104 66.24 37000         Female
##  4 101105 35.33 1500          Female
##  5 101106 29.97 48000         Female
##  6 101107 24.28 35000         Male  
##  7 101108 45.16 17000         Male  
##  8 101109 66.01 68000         Male  
##  9 101110 28.28 Nonrespondent Male  
## 10 101111 38.85 42000         Male

2.1.3 Statistical transformation (stat)

A statistical transformation transforms the underlying data before plotting it.

Example: Imagine creating a scatterplot of the relationship between HS math test score (x-axis) and 2011 income (y-axis)

  • When creating a scatterplot we usually do not transform the data prior to plotting
  • This is the “identity” transformation (default for plots like scatterplots)
els %>% select(stu_id,bynels2m,f3ern2011) %>% rename(x=bynels2m, y=f3ern2011) %>% 
  as_factor() %>% head(10)
## # A tibble: 10 x 3
##    stu_id x     y            
##     <dbl> <fct> <fct>        
##  1 101101 47.84 4000         
##  2 101102 55.3  3000         
##  3 101104 66.24 37000        
##  4 101105 35.33 1500         
##  5 101106 29.97 48000        
##  6 101107 24.28 35000        
##  7 101108 45.16 17000        
##  8 101109 66.01 68000        
##  9 101110 28.28 Nonrespondent
## 10 101111 38.85 42000

Example: Imagine creating a bar chart of the number of students by race/ethnicity

  • Here, we do not plot the raw data. Rather, we count the number of observations for each race/ethnicity category.
  • This is the “count” transformation (default for plots like barplots)
els %>% count(byrace) %>% as_factor()
## # A tibble: 9 x 2
##   byrace                                       n
##   <fct>                                    <int>
## 1 Survey component legitimate skip/NA        305
## 2 Nonrespondent                              648
## 3 Amer. Indian/Alaska Native, non-Hispanic   130
## 4 Asian, Hawaii/Pac. Islander,non-Hispanic  1460
## 5 Black or African American, non-Hispanic   2020
## 6 Hispanic, no race specified                996
## 7 Hispanic, race specified                  1221
## 8 More than one race, non-Hispanic           735
## 9 White, non-Hispanic                       8682

2.1.4 Geometric objects (geoms)

Graphs visually display data, using geometric objects like a point, line, bar, etc.

  • Each geometric object in a graph is called a “geom”
  • “A geom is the geometrical object that a plot uses to represent data” (Wickham & Grolemund, 2017, Chapter 3)
  • “People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms” (Wickham & Grolemund, 2017, Chapter 3)
  • Aesthetics are “visual attributes of the geom” (e.g., color, fill, shape, position) (Grammar of Graphics)
    • Each geom can only display certain aesthetics
    • For example, a “point geom” can only include the aesthetics position, color, shape, and size
  • We can plot the same underlying data using different geoms (e.g., bar chart vs. pie chart)
  • A single graph can layer multiple geoms (e.g., scatterplot with a “line of best fit” layered on top)

2.1.5 Position adjustment (position)

Position adjustment adjusts the position of visual elements in the plot so that these visual elements do not overlap with one another in ways that make the plot difficult to interpret.

Example: The dataset mpg (included in the ggplot2 package) contains variables for the specifications of different cars, with 234 observations

  • Create a scatterplot of the relationship between number of cylinders in the engine (x-axis) and highway miles-per-gallon (y-axis)
  • Below plot is difficult to interpet because many points overlap with one another
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
  geom_point()

  • The jitter position adjustment “adds a small amount of random variation to the location of each point” (from ?geom_jitter)
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
  geom_point(position = "jitter")

2.2 Coordinate system (coord)

“A coordinate system maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn. Plots typically use two coordinates (x,y), but could use any number of coordinates.” (Grammar of Graphics)

Example: Cartesian coordinate system

  • Most plots use the Cartesian coordinate system
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "blank", xlab = NULL, ylab = NULL) +
  theme_bw()

p +
  ggtitle(label = "Cartesian coordinate system")

  • Use coord_fixed() to fix the scaling of the coordinate system
p +
  coord_fixed()

  • When using the default Cartesian coordinate system, a common task is to flip the x and y axis using coord_flip(). (From R for Data Science)
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

Example: Polar coordinate system

p +
  coord_polar() +
  ggtitle(label = "Polar coordinate system")

2.3 Faceting scheme (facets)

Facets are subplots that display one subset of the data. They are most commonly used to create “small multiples”

Example: Imagine creating a scatterplot of the relationship between number of cylinders in the engine (x-axis) and highway miles-per-gallon (y-axis), with separate subplots for car class (e.g., midsize, minivan, pickup, suv)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = cyl, y = hwy), position = "jitter") + 
  facet_wrap(~ class, nrow = 2)

3 Creating graphs using ggplot

3.1 ggplot() and aes() functions

Show help pages for package ggplot2:

help(package = ggplot2)

The ggplot() function:

?ggplot

# SYNTAX AND DEFAULT VALUES
ggplot(data = NULL, mapping = aes())
  • Description (from help file)
    • ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden”
  • Arguments
    • data: Dataset to use for plot. If not specified in ggplot() function, must be supplied in each layer added to the plot.
    • mapping: Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot.

The aes() function (often called within the ggplot() function):

?aes

# SYNTAX
aes(x, y, ...)
  • Description (from help file)
    • “Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. Aesthetic mappings can be set in ggplot() and in individual layers.”
  • Arguments
    • x, y, ...: List of name value pairs giving aesthetics to map to variables
      • The names for x and y aesthetics are typically omitted because they are so common
      • All other aesthetics must be named

Example: Putting ggplot() and aes() together

  • Specifying ggplot() and aes() without specifying a geom layer (e.g., geom_point()) creates a blank ggplot:
ggplot(data = diamonds, aes(x = carat, y = price))

ggplot(data = diamonds, mapping = aes(x = carat, y = price))

  • Alternatively, we can use pipes with the dataframe we want to plot, which allows us to omit the first data argument of ggplot():
class(diamonds)
## [1] "tbl_df"     "tbl"        "data.frame"
diamonds %>% ggplot(mapping = aes(x = carat, y = price))

  • We can also create a ggplot object and assign it to a variable for later use:
diam_ggplot <- ggplot(data = diamonds, aes(x = carat, y = price))

diam_ggplot # blank ggplot

  • Investigate ggplot object:
typeof(diam_ggplot)
## [1] "list"
class(diam_ggplot)
## [1] "gg"     "ggplot"
str(diam_ggplot)
## List of 9
##  $ data       : tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##   ..$ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##   ..$ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##   ..$ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##   ..$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##   ..$ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##   ..$ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##   ..$ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##   ..$ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##   ..$ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##   ..$ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ layers     : list()
##  $ scales     :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
##     add: function
##     clone: function
##     find: function
##     get_scales: function
##     has_scale: function
##     input: function
##     n: function
##     non_position_scales: function
##     scales: NULL
##     super:  <ggproto object: Class ScalesList, gg> 
##  $ mapping    :List of 2
##   ..$ x: language ~carat
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..$ y: language ~price
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..- attr(*, "class")= chr "uneval"
##  $ theme      : list()
##  $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: TRUE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord, gg> 
##  $ facet      :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg> 
##  $ plot_env   :<environment: R_GlobalEnv> 
##  $ labels     :List of 2
##   ..$ x: chr "carat"
##   ..$ y: chr "price"
##  - attr(*, "class")= chr [1:2] "gg" "ggplot"
  • Attributes of ggplot object:
attributes(diam_ggplot)
## $names
## [1] "data"        "layers"      "scales"      "mapping"     "theme"      
## [6] "coordinates" "facet"       "plot_env"    "labels"     
## 
## $class
## [1] "gg"     "ggplot"
diam_ggplot$mapping
## Aesthetic mapping: 
## * `x` -> `carat`
## * `y` -> `price`
diam_ggplot$labels
## $x
## [1] "carat"
## 
## $y
## [1] "price"

3.2 Adding geometric layers

Adding a geometric layer to a ggplot object dictates how observations are displayed in the plot.

  • Geometric layers are specified using “geom functions”
  • There are many different geom functions:
    • geom_point(): creates a scatterplot
    • geom_bar(): creates a bar chart
    • etc.

3.2.1 Scatterplots using geom_point()

Scatterplots are most useful for showing the relationship between two continuous variables.

Example: Scatterplot of the relationship between carat and price, using the diamonds dataset

#ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_point()

  • If we already created and assigned a ggplot object, we can use that object to create the plot:
diam_ggplot + geom_point()

Example: Scatterplot of the relationship between high school math test score (bynels2m) and 2011 earnings (f3ern2011), using the els dataset

  • First, let’s investigate the underlying variables:
els %>% select(bynels2m,f3ern2011) %>%
  summarize_all(.funs = list(~ mean(., na.rm = TRUE), ~ min(., na.rm = TRUE), ~ max(., na.rm = TRUE)))
## # A tibble: 1 x 6
##   bynels2m_mean f3ern2011_mean bynels2m_min f3ern2011_min bynels2m_max
##           <dbl>          <dbl>        <dbl>         <dbl>        <dbl>
## 1          44.3         21276.           -8            -8         79.3
## # … with 1 more variable: f3ern2011_max <dbl>
  • Investigate values less than zero:
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m)
## # A tibble: 1 x 2
##                                   bynels2m     n
##                                  <dbl+lbl> <int>
## 1 -8 [Survey component legitimate skip/NA]   305
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m) %>% as_factor()
## # A tibble: 1 x 2
##   bynels2m                                n
##   <fct>                               <int>
## 1 Survey component legitimate skip/NA   305
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011)
## # A tibble: 2 x 2
##                                  f3ern2011     n
##                                  <dbl+lbl> <int>
## 1 -8 [Survey component legitimate skip/NA]   459
## 2 -4 [Nonrespondent]                        2488
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011) %>% as_factor()
## # A tibble: 2 x 2
##   f3ern2011                               n
##   <fct>                               <int>
## 1 Survey component legitimate skip/NA   459
## 2 Nonrespondent                        2488
  • Create version of variables that replace values less than zero with NA:
els_v2 <- els %>% 
  mutate(
    hs_math = if_else(bynels2m<0,NA_real_,as.numeric(bynels2m)),
    earn2011 = if_else(f3ern2011<0,NA_real_,as.numeric(f3ern2011)),
  )

#check
els_v2 %>% filter(bynels2m<0) %>% count(bynels2m, hs_math)
## # A tibble: 1 x 3
##                                   bynels2m hs_math     n
##                                  <dbl+lbl>   <dbl> <int>
## 1 -8 [Survey component legitimate skip/NA]      NA   305
els_v2 %>% filter(f3ern2011<0) %>% count(f3ern2011, earn2011)
## # A tibble: 2 x 3
##                                  f3ern2011 earn2011     n
##                                  <dbl+lbl>    <dbl> <int>
## 1 -8 [Survey component legitimate skip/NA]       NA   459
## 2 -4 [Nonrespondent]                             NA  2488
els_v2 %>% count(bypared) %>% as_factor()
## # A tibble: 11 x 2
##    bypared                                      n
##    <fct>                                    <int>
##  1 Missing                                     49
##  2 Survey component legitimate skip/NA        179
##  3 Nonrespondent                              648
##  4 Did not finish high school                 944
##  5 Graduated from high school or GED         3053
##  6 Attended 2-year school, no degree         1666
##  7 Graduated from 2-year school              1597
##  8 Attended college, no 4-year degree        1758
##  9 Graduated from college                    3468
## 10 Completed Master's degree or equivalent   1786
## 11 Completed PhD, MD, other advanced degree  1049
  • To avoid scatterplot with too many points, create a dataframe consisting of students whose parents have a PhD or first professional degree:
els_parphd <- els_v2 %>% filter(bypared==8)
  • Plot the scatterplot:
ggplot(data= els_parphd, aes(x = hs_math, y = earn2011)) + geom_point()


The geom_point() function:

?geom_point

# SYNTAX AND DEFAULT VALUES
geom_point(mapping = NULL, data = NULL, stat = "identity",
           position = "identity", ..., na.rm = FALSE, show.legend = NA,
           inherit.aes = TRUE)
  • Aesthetics: geom_point() understands (i.e., accepts) the following aesthetics (required aesthetics in bold)
    • x, y, alpha, colour, fill, group, shape, size, stroke
    • Note: Other geom functions (e.g., geom_bar()) accepts a different set of aesthetics


Example: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), using the mpg dataset

  • Color of points determined by type of car (class):
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point()

  • Alternatively, the color aesthetic can be specified within geom_point():
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class))

Student Task: Using the els_parphd dataset, create a scatterplot of the relationship between HS math score (hs_math) on the x-axis and 2011 earnings (earn2011) on the y-axis, with the color of points determined by sex (f1sex)

Solution
  • Below code doesn’t work because aes() expects the color aesthetic to be a factor variable:
ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = f1sex)) + geom_point()
  • This works:
ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) + geom_point()


3.2.2 Smoothed prediction lines using geom_smooth()

Why use geom_smooth()?

  • The biggest problem with scatterplots is “overplotting.” That is, when you plot many observations, points may be plotted on top of one another and it becomes difficult to visually discern the relationship:
ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_point()

  • Instead, using geom_smooth() creates smoothed prediction lines with shaded confidence intervals:
ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_smooth()


The geom_smooth() function:

?geom_smooth

# SYNTAX AND DEFAULT VALUES
geom_smooth(mapping = NULL, data = NULL, stat = "smooth",
            position = "identity", ..., method = "auto", formula = y ~ x,
            se = TRUE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
  • Arguments
    • Note default “statistical transformation” (stat), as compared to that of geom_point():
      • stat = "smooth" for geom_smooth()
      • stat = "identity" for geom_point()
  • Aesthetics: geom_smooth() accepts the following aesthetics (required aesthetics in bold)
    • x, y, alpha, colour, fill, group, linetype, size, weight, ymax, ymin


Example: Smoothed prediction lines for high school math test score (bynels2m) versus 2011 earnings (f3ern2011), using the els dataset

  • This code produces same plot as above, when aesthetics were specified in ggplot():
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011))

  • Use group aesthetic to create separate prediction lines by sex (f1sex):
#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, group=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, group=as_factor(f1sex)))

  • Use linetype aesthetic to create separate prediction lines (with different line styles) by sex (f1sex):
#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex)))

  • Use color aesthetic to create separate prediction lines (with different colors) by sex (f1sex):
#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, color=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, color=as_factor(f1sex)))

3.2.3 Plotting multiple geom layers

Example: Layer smoothed prediction lines (geom_smooth()) on top of scatterplot (geom_point())

ggplot(data= els_v2) + 
  geom_point(mapping = aes(x = hs_math, y = earn2011)) + 
  geom_smooth(mapping = aes(x = hs_math, y = earn2011))

  • Equivalently, the same plot can be created using this syntax:
ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) + 
  geom_point() +
  geom_smooth()

  • Adjust x-axis and y-axis limits by using + xlim() and + ylim():
ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) + 
  geom_point() +
  geom_smooth() +
  xlim(c(20,80)) + ylim(c(0,100000))

  • Layer smoothed prediction lines with different line types by sex (f1sex) on top of scatterplot with different point colors by sex:
ggplot(data= els_v2) + 
  geom_point(mapping = aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) + 
  geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype = as_factor(f1sex))) +
  xlim(c(20,80)) + ylim(c(0,100000))

3.2.4 Bar charts using geom_bar() and geom_col()

Bar charts are used to plot a single, discrete variable.

  • X-axis typically represents a categorical variable (e.g,. race, sex, institutional type)
    • Each value of the categorical variable is a “group”
  • Y-axis often represents the number of cases in a group (or the proportion of cases in a group)
    • But height of bar could also represent mean value for a group or some other summary statistic (e.g., min, max, std)

Two geom functions to create bar charts:

  • geom_bar(): The height of each bar represents the number of cases (i.e., observations) in the group
    • Statistical transformation = “count”
      • Y-value for a group is the number of cases in the group
    • Use geom_bar() when using (for example) student-level data and you don’t want to summarize student-level data prior to creating the chart
  • geom_col(): The height of each bar represents the value of some variable for the group
    • Statistical transformation = “identity”
      • Y-value for a group is the value of a variable in the dataframe
    • Use geom_col() when you have already created an object of summary statistics (e.g., counts, mean value, etc.)


The geom_bar() and geom_col() functions:

?geom_bar

# SYNTAX AND DEFAULT VALUES
geom_bar(mapping = NULL, data = NULL, stat = "count",
         position = "stack", ..., width = NULL, binwidth = NULL,
         na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)


?geom_col

# SYNTAX AND DEFAULT VALUES
geom_col(mapping = NULL, data = NULL, position = "stack", ...,
         width = NULL, na.rm = FALSE, show.legend = NA,
         inherit.aes = TRUE)


Example: Bar chart with the variable cut (e.g., “Fair,” “Good,” “Ideal”) as x-axis and number of diamonds as y-axis, using the diamonds dataset

  • Essentially, you are being asked to create a bar chart from the following frequency count:
diamonds %>% count(cut)
## # A tibble: 5 x 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

Method 1: Create bar chart using geom_bar()

ggplot(data = diamonds, aes(x = cut)) +
  geom_bar()

Method 2: Create bar chart using geom_col()

  • First, create an object of frequency count for the variable cut:
cut_count <- diamonds %>% count(cut)
cut_count
## # A tibble: 5 x 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551
  • Next, use ggplot() + geom_col to plot the data from the object cut_count:
ggplot(data = cut_count, aes(x = cut, y=n)) +
  geom_col()

  • Alternatively, we can use pipes to create the plot without creating a separate cut_count object first:
#diamonds %>% count(cut) %>% str()
diamonds %>% count(cut) %>% str()
## tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##  $ cut: Ord.factor w/ 5 levels "Fair"<"Good"<..: 1 2 3 4 5
##  $ n  : int [1:5] 1610 4906 12082 13791 21551
diamonds %>% count(cut) %>% ggplot(aes(x= cut, y=n)) + 
  geom_col()

Student Task: Using the els_v2 dataset, create a bar chart with the variable “ever attended postsecondary education” (f2evratt) as x-axis and number of students as y-axis

Solution
  • Essentially, you are being asked to create a bar chart from the following frequency count:
els_v2 %>% count(f2evratt) %>% as_factor()
## # A tibble: 5 x 2
##   f2evratt                                n
##   <fct>                               <int>
## 1 Survey component legitimate skip/NA   359
## 2 Nonrespondent                        1691
## 3 Item legitimate skip/NA               108
## 4 No                                   3505
## 5 Yes                                 10534

Method 1: Create bar chart using geom_bar()

ggplot(data = els_v2, aes(x = as_factor(f2evratt))) +
  geom_bar()

  • Additionally, we can use pipes to filter values of f2evratt before plotting:
els_v2 %>% filter(f2evratt>=0) %>% ggplot(aes(x = as_factor(f2evratt))) +
  geom_bar()

Method 2: Create bar chart using geom_col()

els_v2 %>% 
  # filter to remove missing values
  filter(f2evratt>=0) %>% 
  # use count() to create summary statistics object
  count(f2evratt) %>%
  # plot summary statistic object
  ggplot(aes(x=as_factor(f2evratt), y=n)) + geom_col()


3.3 Small multiples using faceting

Facets divide a plot into subplots based on the values of one or more discrete variables. They are most commonly used to create “small multiples”

Two functions to split your plots into facets:

  • facet_grid(): Display subplots in grid format, where rows and columns are determined by the faceting variable(s)
    • facet_grid() is most useful when you have two discrete variables, and all combinations of the variables exist in the data
  • facet_wrap(): Display all subplots side-by-side, but can be wrapped to fill multiple rows
    • facet_wrap() generally has better use of screen space, and you can specify the number of plots in each row or column


The facet_grid() and facet_wrap() functions:

?facet_grid

# SYNTAX AND DEFAULT VALUES
facet_grid(rows = NULL, cols = NULL, scales = "fixed",
  space = "fixed", shrink = TRUE, labeller = "label_value",
  as.table = TRUE, switch = NULL, drop = TRUE, margins = FALSE,
  facets = NULL)

?facet_wrap

# SYNTAX AND DEFAULT VALUES
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
  shrink = TRUE, labeller = "label_value", as.table = TRUE,
  switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

Specifying which variable(s) to facet your plot on:

  • facet_grid()
    • Since facet_grid() arranges subplots in a grid format, we need to specify how we define the rows and columns
    • One way to do this is passing in the rows and cols arguments, which should be variables quoted by vars()
      • facet_grid(rows = vars(<var_1>), cols = vars(<var_2>)): facet into both rows and columns
      • facet_grid(rows = vars(<var_1>)): facet into rows only
      • facet_grid(cols = vars(<var_1>)): facet into columns only
    • Alternatively, we can pass in a formula, which has the syntax <row_var> ~ <col_var>
      • facet_grid(<var_1> ~ <var_2>): facet into both rows and columns
      • facet_grid(<var_1> ~ .): facet into rows only
      • facet_grid(. ~ <var_1>): facet into columns only
  • facet_wrap()
    • facet_wrap() also accepts a formula for its facets argument
      • facet_wrap(~ <var_1>): facet by one variable
      • facet_wrap(<var_1> ~ <var_2>): facet on the combination of two variables

3.3.1 Faceting by one variable

Example: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl), from the mpg dataset

Method 1: Faceting using facet_grid()

  • For one variable, you can choose to facet into rows or columns:
# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(rows = vars(cyl))

# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(cols = vars(cyl))

  • Alternatively, we could specify the input as a formula to get the same results:
# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(cyl ~ .)

# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(. ~ cyl)

Method 2: Faceting using facet_wrap()

  • Unlike facet_grid(), facet_wrap() is not restricted to either rows or columns:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cyl)

  • But we are free to set the number of rows or columns if we wanted:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cyl, nrow = 1)

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cyl, ncol = 1)

3.3.2 Faceting by two variables

Example: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl) and type of car (class), from the mpg dataset

Method 1: Faceting using facet_grid()

  • For example, we can make the rows based on cyl and the columns based on class:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(rows = vars(cyl), cols = vars(class))

  • Alternatively, we could specify the input as a formula to get the same results:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(cyl ~ class)

Method 2: Faceting using facet_wrap()

  • Since facet_wrap() is not defined by rows and columns, it omits any subplots that do not display any data:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(cyl ~ class)

  • We are also free to choose the number of rows or columns to display:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(cyl ~ class, nrow = 3)

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(cyl ~ class, ncol = 4)

4 Customization

There are many ways to customize the display of our plot. For this section, we will build upon this scatterplot we saw earlier:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_point()

4.1 Labels

Functions to add title and axis labels:

  • ggtitle(): Add title of graph
  • xlab(): Add x-axis label
  • ylab(): Add y-axis label

Example: Adding title and axis labels

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price')

4.2 Scales

The scale_x_continuous() and scale_y_continuous() functions:

?scale_x_continuous
?scale_y_continuous

# SYNTAX AND DEFAULT VALUES
scale_x_continuous(
  name = waiver(),
  breaks = waiver(),
  minor_breaks = waiver(),
  n.breaks = NULL,
  labels = waiver(),
  limits = NULL,
  expand = waiver(),
  oob = censor,
  na.value = NA_real_,
  trans = "identity",
  guide = waiver(),
  position = "bottom",
  sec.axis = waiver()
)

scale_y_continuous(
  name = waiver(),
  breaks = waiver(),
  minor_breaks = waiver(),
  n.breaks = NULL,
  labels = waiver(),
  limits = NULL,
  expand = waiver(),
  oob = censor,
  na.value = NA_real_,
  trans = "identity",
  guide = waiver(),
  position = "left",
  sec.axis = waiver()
)
  • Description (from help file)
    • scale_x_continuous() and scale_y_continuous() are the default scales for continuous x and y aesthetics.”
  • Arguments
    • name: The name of the scale. Used as the axis or legend title.
    • labels: Custom labelling of the scales (i.e., ticks)
    • limits: Limits of the scale (i.e., min/max values)
    • position: The position of the axis. ('left' or 'right' for y axes, 'top' or 'bottom' for x axes)


The label_number() function:

?label_number

# SYNTAX AND DEFAULT VALUES
label_number(
  accuracy = NULL,
  scale = 1,
  prefix = "",
  suffix = "",
  big.mark = " ",
  decimal.mark = ".",
  trim = TRUE,
  ...
)
  • Description (from help file)
    • “Use label_number() force decimal display of numbers (i.e. don’t use scientific notation)”
  • Arguments
    • accuracy: A number to round to (e.g. use 0.01 to show 2 decimal places of precision)
    • scale: A scaling factor (e.g., x will be multiplied by scale before formatting)
    • prefix: Symbols to display before value
    • suffix: Symbols to display after value


Example: Formatting numbers on the y-axis

We can use scale_y_continuous(), in conjunction with label_number() from the scales package, to format the numbers on the y-axis:

  • Use prefix to add $ before the number
  • Use suffix to add K after the number
  • Use scale of 1e-3 to divide number by 1000
  • Use accuracy of 1 to round number to the ones digit
diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))

4.3 Colors

There are several ways to customize the color palettes of the plot, including scale_color_brewer() for discrete scale and scale_color_gradient() for continuous scale.


The scale_color_brewer() function:

?scale_color_brewer

# SYNTAX AND DEFAULT VALUES
scale_color_brewer(
  ...,
  type = "seq",
  palette = 1,
  direction = 1,
  aesthetics = "colour"
)
  • Description (from help file)
    • “The brewer scales provides sequential, diverging and qualitative colour schemes from ColorBrewer”
  • Arguments
    • palette: Name of the color palette (see below)
    • direction: 1 for default ordering, -1 for reverse ordering


Example: Customizing color palette of discrete scale

Let’s color the points by the diamond color. This is the default display:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = color)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))

We can use scale_color_brewer() to customize the color palette. This also accepts other arguments for labeling including name to specify the legend title:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = color)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
  scale_color_brewer(palette = 'Spectral', name = 'Color')


The scale_color_gradient() function:

?scale_color_gradient

# SYNTAX AND DEFAULT VALUES
scale_color_gradient(
  ...,
  low = "#132B43",
  high = "#56B1F7",
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour"
)
  • Description (from help file)
    • scale_*_gradient creates a two colour gradient (low-high)”
  • Arguments
    • low: Color for low end of the gradient
    • high: Color for high end of the gradient


Example: Customizing color palette of continuous scale

Let’s color the points by the diamond depth percentage. This is the default display:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = depth)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))

We can use scale_color_gradient() to customize the color palette. This also accepts other arguments for labeling including name to specify the legend title and labels to customize the legend values:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = depth)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
  scale_color_gradient(low = 'white', high = 'purple', name = 'Depth percentage', 
                       labels = label_number(suffix = '%'))

4.4 Themes

To customize the display of the plot, ggplot offers several preset themes, including:

  • theme_grey() (default)
  • theme_bw()
  • theme_light()
  • theme_dark()
  • theme_minimal()
  • theme_classic()

Example: Using preset theme

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = color)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
  scale_color_brewer(palette = 'Spectral', name = 'Color') +
  theme_minimal()


We can also use theme() to customize specific components of the plot.

Example: Using custom theme

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price, color = color)) + 
  geom_point() +
  ggtitle('Correlation between diamond carat and price') +
  xlab('Carat') + ylab('Price') +
  scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
  scale_color_brewer(palette = 'Spectral', name = 'Color') +
  theme(
    text = element_text(size = 8),
    panel.background = element_blank(),
    plot.title = element_text(color = '#444444', size = 10, hjust = 0.5, face = 'bold'),
    axis.ticks = element_blank(),
    axis.title = element_text(face = 'bold'),
    legend.title = element_text(face = 'bold'),
    legend.key = element_blank(),
    legend.key.size = unit(0.5, 'cm')
  )

5 Exporting plots

The plots generated by ggplot can be exported as a PDF, PNG, or other file types. (From Creating and Saving Graphs - R Base Graphs)

5.1 Exporting in RStudio

In RStudio, the generated plots will typically be displayed in the lower right panel. There is an Export button that allows you to save the plot as a PDF or PNG:

5.2 Exporting via R code

There are also various R functions, including jpeg(), png(), svg(), and pdf(), for exporting plots.

The steps for saving a plot:

  • Use one of the R functions to open a file
    • Optional arguments include height and width for specifying image dimension
  • Create the plot
  • Close the file with dev.off()

Example: Exporting plot using pdf()

# Open the file
pdf('Rplot.pdf')

# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point()

# Close the file
dev.off()

Example: Exporting plot using jpeg()

# Open the file
jpeg('Rplot.jpg', width = 350, height = 350)

# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point()

# Close the file
dev.off()

References

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. Retrieved from https://r4ds.had.co.nz/
Wilkinson, L. (1999). The grammar of graphics (pp. xvii, 408 p.). book, New York: Springer.